[batch] fix g2 machine memory calculation #14498

sjparsa · 2024-04-23T17:32:42Z

The memory that machines use is calculated based off of the number of cores the machine has, with a fix ratio set of cores to Gb of RAM. The code assumed a ratio of 3.75 cores: 1 Gb but this assumption does not hold outside of the n1 machine family. This PR changes the functions that calculate memory from number of cores to account for this by passing the machine_type as the main parameter rather than the worker type. Then, logic is added to use the appropriate ratio of 4:1 for the g2 family of machines.

daniel-goldstein

Leaving my comments from our discussion yesterday. This looks great and I'm really happy we caught this bug. Just a few comments on the organization.

It would also be great to have some unit testing around the problematic function: gcp_worker_memory_per_core_mib. Take a look at batch/test/test_utils.py, it contains a couple unit tests that just test the behavior of a couple of individual functions. You can run that test file with pytest batch/test/test_utils.py. This would be a good place to add tests for gcp_worker_memory_per_core_mib. We should cover at least a handful of n1s and g2s and some invalid / not-yet-supported machine types that should raise some sort of error.

batch/batch/cloud/gcp/resource_utils.py

batch/batch/cloud/azure/instance_config.py

batch/batch/cloud/gcp/instance_config.py

batch/batch/cloud/gcp/resource_utils.py

batch/batch/cloud/resource_utils.py

batch/batch/front_end/front_end.py

batch/batch/worker/worker.py

daniel-goldstein

This is looking really great! I think there's one spot that was missed w.r.t. hard-coding n1s and then a little more cleanup, but we're getting really close.

batch/batch/cloud/gcp/instance_config.py

batch/batch/inst_coll_config.py

daniel-goldstein

This is great, thanks for all your work fixing this sneaky bug!

sjparsa requested review from cseed and daniel-goldstein April 23, 2024 17:32

sjparsa force-pushed the gpu_fix branch from 0f580c0 to f7991b1 Compare April 25, 2024 21:20

daniel-goldstein changed the title ~~Gpu fix~~ [batch] fix g2 machine memory calculation Apr 26, 2024

daniel-goldstein suggested changes Apr 26, 2024

View reviewed changes

daniel-goldstein and others added 3 commits April 30, 2024 15:12

[batch] Get rid of machine_type_to_worker_type_cores

5e55aac

[batch] fix g2 machine memory calculation

9eefae7

resolved front jp instance manager

478f9c8

sjparsa force-pushed the gpu_fix branch from 7f131f5 to 478f9c8 Compare April 30, 2024 19:47

fix worker memory calls

3eeb65a

daniel-goldstein reviewed Apr 30, 2024

View reviewed changes

batch/batch/worker/worker.py Outdated Show resolved Hide resolved

Sophie Parsa added 2 commits April 30, 2024 17:06

fix memory calculation

4d2b98c

fix memory bytes calc in instance config

ae40c94

sjparsa requested a review from daniel-goldstein May 1, 2024 18:26

Sophie Parsa added 2 commits May 1, 2024 15:06

delete commented out code

1cbffd7

Merge branch 'main' into gpu_fix

a6eb31a

sjparsa force-pushed the gpu_fix branch from 57c7af8 to a6eb31a Compare May 1, 2024 19:11

Sophie Parsa added 3 commits May 1, 2024 15:16

simplify

3884d6c

fix worker mem calc

db1982e

lint fix

a78568d

daniel-goldstein suggested changes May 2, 2024

View reviewed changes

batch/batch/cloud/gcp/instance_config.py Outdated Show resolved Hide resolved

batch/batch/inst_coll_config.py Outdated Show resolved Hide resolved

Sophie Parsa added 2 commits May 2, 2024 21:16

review

1676a85

fix

10df17d

daniel-goldstein approved these changes May 3, 2024

View reviewed changes

hail-ci-robot merged commit c093074 into hail-is:main May 3, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[batch] fix g2 machine memory calculation #14498

[batch] fix g2 machine memory calculation #14498

sjparsa commented Apr 23, 2024

daniel-goldstein left a comment •

edited

Loading

daniel-goldstein left a comment

daniel-goldstein left a comment

[batch] fix g2 machine memory calculation #14498

[batch] fix g2 machine memory calculation #14498

Conversation

sjparsa commented Apr 23, 2024

daniel-goldstein left a comment • edited Loading

Choose a reason for hiding this comment

daniel-goldstein left a comment

Choose a reason for hiding this comment

daniel-goldstein left a comment

Choose a reason for hiding this comment

daniel-goldstein left a comment •

edited

Loading